{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "#r \"nuget: TorchSharp-cuda-windows\"\n",
    "\n",
    "open TorchSharp\n",
    "open type TorchSharp.torch\n",
    "open type TorchSharp.TensorExtensionMethods"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using CUDA\n",
    "\n",
    "This tutorial is the only one that does not use the 'TorchSharp-cpu' package. When you have a machine with a GPU that supports CUDA programming, you can use either 'TorchSharp-cuda-windows' or 'TorchSharp-cuda-linux' depending on your operating system. There is no CUDA distribution for MacOS. \n",
    "\n",
    "Using CUDA, especially for training, can boost the performance significantly, typically a couple of orders of magnitude. It may be the difference between model training being feasible and not.\n",
    "\n",
    "Note: The tutorials won't require much in terms of capabilities, but for training real vision models with even modest data sizes, you need at least 8MB of dedicated GPU memory. Even something as simple as CIFAR10 (in the Examples solution in this repo) requires that much memory in order not to blow up. 6MB, a common memory size on Nvidia-enabled laptops, is not enough."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the right backend package installed, using CUDA is very straight-forward. TorchSharp lets you create and use tensors on the GPU as easily as on the CPU. Previously, we haven't use the 'device' argument when creating tensors, but it's easy to use. Notice that the string representation looks difference -- it says 'cuda:0' instead of 'cpu' for the device."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "torch.ones(3,4, device=torch.CUDA)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you have more than one GPU, you can target a specific one by creating a device object to represent each one. The static `torch.CUDA` instance defaults to the first enumerated GPU.\n",
    "\n",
    "Using GPU tensor is just like CPU tensors, there really is no difference. You cannot mix and match, though. Almost all algorithms require operands to be on the same device or to be moved there first. When you run the cell below, you should see the first addition print out a tensor filled with '2', then get an exception from the second addition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "let a = torch.ones(3,4, device=torch.CUDA)\n",
    "let b = torch.ones(3,4, device=torch.CUDA)\n",
    "let c = torch.ones(3,4, device=torch.CPU)\n",
    "\n",
    "(a + b).print()\n",
    "(a + c).print()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copying a tensor from device to device is very straight-forward. Note that applying 'to()' to a tensor does not move it. Because of this inefficiency, it is better to create a tensor where it belongs, than to first create it on the CPU and move it over later. That said, it is often necessary to copy input data, because it comes from files and in-memory data that can only be turned into tensors on a CPU.\n",
    "\n",
    "There are two primary ways to copy a tensor to the GPU -- using 'cuda()' or using 'to()' The former is simpler, while the latter is more flexible -- it allows you to simultaneously convert the element type of a tensor and copy it over. Unfortunately, using 'to()' requires identifier escape in F#."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "let a = torch.ones(3,4, device=torch.CUDA)\n",
    "let b = torch.ones(3,4, device=torch.CUDA)\n",
    "let c = torch.ones(3,4, device=torch.CPU)\n",
    "\n",
    "(a + c.cuda()).print()\n",
    "c.print()\n",
    "\n",
    "// or:\n",
    "\n",
    "(a + c.``to``(device=torch.CUDA)).print()\n",
    "c.print()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GPU Memory Management\n",
    "\n",
    "Before moving on, it is important to discuss explicit memory management of tensors. Because CUDA does not have any virtual memory mechanism, it is easy to run out of GPU memory unless it is carefully managed.\n",
    "\n",
    "TorchSharp tensors are eventually garbage-collected, which is triggered when the heap is starting to get full. However, the heap is all CPU memory, and only what is seen by the managed runtime. The storage for tensors are allocated in native code and doesn't therefore increase the memory pressure that triggers GC. This is particularly precarious for GPU memory.\n",
    "\n",
    "Therefore, the tensor class implements IDisposable, so that you can manually free memory.\n",
    "\n",
    "TorchSharp arithmetic results in a lot of temporaries, which need to be freed when no longer used. Consider this expression:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "(a + b) * (a + c.cuda())"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are three temporaries here, none of which are seeing their native memory freed. To demonstrate, let's pull all the temporaries out into explicit variables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "let t0 = a + b\n",
    "let t1 = c.cuda()\n",
    "let t2 = a + t1\n",
    "let t3 = t0 * t2\n",
    "\n",
    "t3"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To deal with this systematically, we need to choose 'use' instead of 'let' for all the temporaries. \n",
    "\n",
    "Note that you cannot have a 'use' at the outermost level of a notebook cell, so to demonstrate, we'll scope it to a 'do' block:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "do \n",
    "    use t4 = a + b\n",
    "    use t5 = c.cuda()\n",
    "    use t6 = a + t1\n",
    "    let t7 = t0 * t2\n",
    "\n",
    "    t7.print() |> ignore"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's worth noting that use of in-place operators cuts down significantly the number of temporaries that have to be disposed."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### DisposeScope\n",
    "\n",
    "In version 0.95.4, TorchSharp introduced the notion of a `DisposeScope`, which deals with the dispose pattern systematically. It introduces the notion of a dynamic (runtime) scope, which controls the liftime of all tensors created while the scope is in effect. The lexical scope, i.e. the source code location where variables are held, etc., had no impact on the dynamic scope management.\n",
    "\n",
    "When any .NET tensor is created, it will be registered with the current dynamic scope, if there is one. Once registered, the tensor will be disposed automatically when the scope is disposed. It doesn't matter if the tensor is held in a variable declared outside or inside the scope.\n",
    "\n",
    "Try running the next cell with and without the 'use d = ...' line (comment it out), and notice how the tensor count grows if you don't have it, but stays constant when you do."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "printf \"%d\\n\" torch.Tensor.TotalCount\n",
    "do \n",
    "    use d = torch.NewDisposeScope()\n",
    "    let t3 = (a + b) * (a + c.cuda())\n",
    "    t3.print() |> ignore\n",
    "\n",
    "printf \"%d\\n\" torch.Tensor.TotalCount"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes, you really need a tensor instance to survive the end of the dynamic scope. It can be detached from the scope, or moved to a surrounding scope (they nest). If there is no surrounding scope, it's the same thing.\n",
    "\n",
    "For example, let's try this example -- you should get an exception that complains about an invalid tensor. It's because it was disposed before calling `print()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "let mutable t3:torch.Tensor = null\n",
    "\n",
    "do \n",
    "    use d = torch.NewDisposeScope()\n",
    "    t3 <- (a + b) * (a + c.cuda())\n",
    "\n",
    "t3.print()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The solution is to detach it before the end of the scope. Once it's been moved outside, the tensor needs to either land in an outer scope that will automatically dispose it, or it needs to be disposed explicitly in order to free native memory (unless waiting until GC kicks in is acceptable)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "let mutable t3:torch.Tensor = null\n",
    "\n",
    "do \n",
    "    use d = torch.NewDisposeScope()\n",
    "    t3 <- d.MoveToOuter((a + b) * (a + c.cuda()))\n",
    "\n",
    "t3.print()\n",
    "printf \"%b\\n\" t3.IsInvalid\n",
    "t3.Dispose()\n",
    "printf \"%b\\n\" t3.IsInvalid"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "dotnet_interactive": {
     "language": "fsharp"
    },
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },
   "outputs": [],
   "source": [
    "let mutable t3:torch.Tensor = null\n",
    "\n",
    "do\n",
    "    use d0 = torch.NewDisposeScope()\n",
    "\n",
    "    do \n",
    "        use d1 = torch.NewDisposeScope()\n",
    "        t3 <- d1.MoveToOuter((a + b) * (a + c.cuda()))\n",
    "\n",
    "    t3.print()\n",
    "    printf \"%b\\n\" t3.IsInvalid\n",
    "\n",
    "printf \"%b\\n\" t3.IsInvalid"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Placing Model Parameters on the GPU\n",
    "\n",
    "To use a GPU, tensors have to be copied or moved there. When you train, your data preparation logic is responsible for getting data to the GPU, but we also need the weights there. TorchSharp supports this by defining a 'to()' method on Modules, which can be used to move (not copy) the weights the model relies on to the GPU (or back to the CPU). We haven't looked at models yet, but keep this in mind for later:\n",
    "\n",
    "```F#\n",
    "let model = ...\n",
    "model.``to``(torch.CUDA)\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".NET (C#)",
   "language": "C#",
   "name": ".net-csharp"
  },
  "language_info": {
   "name": "C#"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
